20  Geographic data analysis

We’re going to continue our work on geographic analysis by looking at how these methods can help us answer a slightly harder question.

Our motivating question for this chapter is this: how many homicides occur near schools in Washington, D.C.?

20.1 Loading libraries

Load libraries as usual.

#install.packages('sf')
#install.packages('leaflet')

library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.2     ✔ purrr   1.0.2
✔ tibble  3.2.1     ✔ dplyr   1.1.4
✔ tidyr   1.3.0     ✔ stringr 1.5.1
✔ readr   2.1.4     ✔ forcats 0.5.2
Warning: package 'dplyr' was built under R version 4.2.3
Warning: package 'stringr' was built under R version 4.2.3
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(janitor)

Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test
library(sf)
Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
library(leaflet)
Warning: package 'leaflet' was built under R version 4.2.3

20.2 Data

We’re loading two dataframes to answer our question.

  • Washington Homicides | spatial points | location and details about each Washington homicide between 2007 and 2016. Details on the data: https://github.com/washingtonpost/data-homicides

  • Washington Schools | spatial points | locations of each Washington, D.C. school. https://nces.ed.gov/programs/edge/geographic/schoollocations

  • Washington D.C. shapefile | shapefile | outline of Washington, D.C.

###
# Load dataframe of washington homicides
###

washington_homicides <- read_rds("assets/data/washington_homicides.rds")

###
# Load dataframe of washington schools
###

washington_schools <- read_rds("assets/data/washington_schools.rds")

###
# Load outline of DC
###

washington_shapefile <- read_rds("assets/data/washington_shapefile.rds")

Here are the 239 schools in Washington.

ggplot() +
  geom_sf(data=washington_shapefile, fill="white") + 
  geom_sf(data=washington_schools, color="purple")

Here are the 1345 homicides in Washington between 2007 and 2017.

ggplot() +
  geom_sf(data=washington_shapefile, fill="white") + 
  geom_sf(data=washington_homicides, color="red")

And here are both on a map.

ggplot() +
  geom_sf(data=washington_shapefile, fill="white") + 
  geom_sf(data=washington_homicides, color="red") +
  geom_sf(data=washington_schools, color="purple") 

Let’s think back to our question: how many homicides occurred near schools?

In the prior lab, we used something called a “spatial join” to tie a school – a geographic point – to a Census tract – a geographic polygon.

We could try something similar here. The code below says: create a new object called schools_homicides by attempting to connect each point in our schools table with each point in our homicides table. If there’s an overlap, return that match to our new table.

schools_homicides <- washington_schools %>%
  st_join(washington_homicides, left="FALSE")

schools_homicides
Simple feature collection with 0 features and 15 fields
Bounding box:  xmin: NA ymin: NA xmax: NA ymax: NA
Geodetic CRS:  NAD83
# A tibble: 0 × 16
# ℹ 16 variables: school_name <chr>, address <chr>, city.x <chr>,
#   state.x <chr>, zip <chr>, geometry <GEOMETRY [°]>, uid <chr>,
#   reported_date <dbl>, victim_last <chr>, victim_first <chr>,
#   victim_race <chr>, victim_age <chr>, victim_sex <chr>, city.y <chr>,
#   state.y <chr>, disposition <chr>

Notice that it returned zero rows. That’s because for there to be a match, a school point had to be essentially on top of a shooting point. The points are very small, and it’s unlikely there will be an exact match in our data.

We can fix that by making the size of the point bigger, by converting it to a polygon. How do we define “near” for the purpose of our question? Let’s use “within 100 meters” of the centerpoint of the school as our definition.

We’re going to use a new function st_buffer to create a “buffer” around our latitude longitude point. We’re going to change our geometry from a small point to a circle with a radius of 100 meters centered on that point.

washington_schools_buffered <- washington_schools %>%
  mutate(geometry = st_buffer(geometry, dist=100))

washington_schools_buffered
Simple feature collection with 239 features and 5 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -77.10182 ymin: 38.82285 xmax: -76.91885 ymax: 38.98553
Geodetic CRS:  NAD83
# A tibble: 239 × 6
   school_name               address city  state zip                    geometry
 * <chr>                     <chr>   <chr> <chr> <chr>             <POLYGON [°]>
 1 Cesar Chavez PCS for Pub… 3701 H… Wash… DC    20019 ((-76.94809 38.90084, -7…
 2 Cesar Chavez Public Char… 3701 H… Wash… DC    20019 ((-76.94809 38.90084, -7…
 3 Friendship PCS - Collegi… 4095 M… Wash… DC    20019 ((-76.94558 38.89821, -7…
 4 Friendship PCS - Southea… 645 MI… Wash… DC    20032 ((-76.99599 38.84797, -7…
 5 Friendship PCS - Technol… 2705 M… Wash… DC    20032 ((-76.99592 38.84915, -7…
 6 Friendship PCS - Blow Pi… 725 19… Wash… DC    20002 ((-76.97583 38.89864, -7…
 7 Friendship PCS - Blow Pi… 725 19… Wash… DC    20002 ((-76.97583 38.89864, -7…
 8 Friendship PCS - Chamber… 1345 P… Wash… DC    20003 ((-76.9878 38.87966, -76…
 9 Friendship PCS - Woodrid… 2959 C… Wash… DC    20018 ((-76.96495 38.92914, -7…
10 Friendship PCS - Woodrid… 2959 C… Wash… DC    20018 ((-76.96495 38.92914, -7…
# ℹ 229 more rows

Let’s visualize it. It’s hard to see the difference from this map, but those purple circles represent a 100 meter radius circle around our school point.

ggplot() +
  geom_sf(data=washington_shapefile, fill="white") + 
  geom_sf(data=washington_schools_buffered, color="purple") 

It’s easier to see with a leaflet map that shows both the points and the buffered circle. Zoom in.

leaflet() %>%
  addProviderTiles(providers$CartoDB.Positron) %>%
  addPolygons(data=washington_schools_buffered, weight=1, fill="purple") %>%
  addCircles(data=washington_schools,
             label=washington_schools$school_name)
Warning: sf layer has inconsistent datum (+proj=longlat +datum=NAD83 +no_defs).
Need '+proj=longlat +datum=WGS84'

Warning: sf layer has inconsistent datum (+proj=longlat +datum=NAD83 +no_defs).
Need '+proj=longlat +datum=WGS84'

We can also see the difference in the geometry column between our buffered dataframe and our point dataframe.

Our washington_schools dataframe is a collection of points.

washington_schools$geometry
Geometry set for 239 features 
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -77.10065 ymin: 38.82376 xmax: -76.92001 ymax: 38.98462
Geodetic CRS:  NAD83
First 5 geometries:
POINT (-76.94906 38.90134)
POINT (-76.94906 38.90134)
POINT (-76.94631 38.89751)
POINT (-76.997 38.84754)
POINT (-76.99629 38.8483)

Our washington_schools_buffered dataframe is a collection of polygons

washington_schools_buffered$geometry
Geometry set for 239 features 
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -77.10182 ymin: 38.82285 xmax: -76.91885 ymax: 38.98553
Geodetic CRS:  NAD83
First 5 geometries:
POLYGON ((-76.94809 38.90084, -76.94808 38.9008...
POLYGON ((-76.94809 38.90084, -76.94808 38.9008...
POLYGON ((-76.94558 38.89821, -76.94558 38.8982...
POLYGON ((-76.99599 38.84797, -76.99599 38.8479...
POLYGON ((-76.99592 38.84915, -76.99592 38.8491...

Now that we have a buffered schools file, we can redo our spatial join.

schools_homicides <- washington_schools_buffered %>%
  st_join(washington_homicides, left="FALSE")

schools_homicides
Simple feature collection with 87 features and 15 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -77.04171 ymin: 38.82285 xmax: -76.91885 ymax: 38.9614
Geodetic CRS:  NAD83
# A tibble: 87 × 16
   school_name      address city.x state.x zip                    geometry uid  
 * <chr>            <chr>   <chr>  <chr>   <chr>             <POLYGON [°]> <chr>
 1 Friendship PCS … 4095 M… Washi… DC      20019 ((-76.94558 38.89821, -7… Was-…
 2 Friendship PCS … 4095 M… Washi… DC      20019 ((-76.94558 38.89821, -7… Was-…
 3 Friendship PCS … 645 MI… Washi… DC      20032 ((-76.99599 38.84797, -7… Was-…
 4 Friendship PCS … 1351 N… Washi… DC      20011 ((-77.03121 38.95965, -7… Was-…
 5 Friendship PCS … 645 MI… Washi… DC      20032 ((-76.99599 38.84806, -7… Was-…
 6 Moten ES         1565 M… Washi… DC      20020 ((-76.98158 38.8566, -76… Was-…
 7 Stanton ES       2701 N… Washi… DC      20020 ((-76.96953 38.85878, -7… Was-…
 8 Sousa MS         3650 E… Washi… DC      20019 ((-76.95386 38.88319, -7… Was-…
 9 Savoy ES         2400 S… Washi… DC      20020 ((-76.99378 38.86371, -7… Was-…
10 Marie Reed ES    2201 1… Washi… DC      20009 ((-77.0414 38.91974, -77… Was-…
# ℹ 77 more rows
# ℹ 9 more variables: reported_date <dbl>, victim_last <chr>,
#   victim_first <chr>, victim_race <chr>, victim_age <chr>, victim_sex <chr>,
#   city.y <chr>, state.y <chr>, disposition <chr>

We get 87 rows. This does NOT mean:

  • We had 87 homicides near schools. Some individual homicides occurred within 100 meters of more than one school.
  • There were homicides near 87 schools. Some schools had multiple homicides.

To answer our original question, “how many homicides occurred near (within 100 meters) of a DC school?”, we need to distinct our list of schools and count them.

schools_homicides %>%
  distinct(school_name) %>%
  count()
# A tibble: 1 × 1
      n
  <int>
1    65

There’s our answer. Out of 239 DC schools, 65 – more than 25 percent – had a homicide occur within 100 meters.

Question #1: Answer this question in English: Write in Elms

Develop two questions that you could answer with a spatial buffer as described in this chapter, outside of the domain of education and crime. Briefly describe the two geographic dataframes you would need to answer each question.